Reinforcement Learning (RL) is not merely a subset of computer science; it is the mathematical formalization of the fundamental survival strategy used by every biological organism. This slide bridges the gap between the wetware of the brain and the hardware of machines, showing how the "trial-and-error" behavior of a hungry animal corresponds directly to the gradient updates of an AI agent.
The Historical Convergence
In 1911, Edward Thorndike proposed the Law of Effect: behaviors followed by satisfaction are likely to recur, while those followed by discomfort are weakened. This became the seed for modern RL's policy updates. Decades later, the **Rescorla-Wagner Model (1972)** mathematically described how animals learn to associate stimuli by minimizing the difference between expected and actual results. Crucially, this biological "prediction error" is exactly what we now call **Temporal Difference (TD) Error** in machine intelligence.
From Pavlov to the Gridworld
- Classical (Pavlovian) Conditioning: Predicting the world. An organism learns that a stimulus (Bell) predicts an outcome (Food). This is equivalent to learning a **Value Function** $V(s)$.
- Instrumental (Operant) Conditioning: Controlling the world. An organism learns that an action (lever press) in a specific context leads to a reward. This maps directly to the **Agent-Environment interaction loop**.
- Intrinsic Rewards: Unlike machines, biological organisms are driven by survival (homeostasis). We use these biological "primes" to design better reward functions that prevent reward hacking in AI systems.